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Abstract 

We consider a contextual version of multi-armed bandit problem with global knapsack constraints. In each round, 
the outcome of pulling an arm is a scalar reward and a resource consumption vector, both dependent on the context, 
and the global knapsack constraints require the total consumption for each resource to be below some pre-fixed bud- 
ge t. The learning agent comp etes with an arbitrary set of context-dependent policies. This problem was introduced 
bv iBadanidivuru et alj l2014ll . who gave a computationally inefficient algorithm with near-optimal regret bounds for 
it. We give a comp utationally efficient al gorithm for this problem with slightly better regret bounds, by generaliz¬ 
ing the approach of lAgarwal et alJ l2014ll for the non-constrained version of the problem. The computational time 
of our algorithm scales lo garithmically in the size of the policy space. This answers the main open question of 
iBadanidivuru et ^ i2Q14ll . We also extend our results to a variant where there are no knapsack constraints but the 
objective is an arbitrary Lipschitz concave function of the sum of outcome vectors. 


1 Introduction 


Multi-armed bandits {e.g., iBubeck and Cesa-Bianchil Il2012l] l are a classic model for studying the exploration 


exploitation tradeoff faced by a decision-making agent, which learns to maximize cumulati ve reward through se¬ 
quent ial experimentation in an initially unknown environ ment. The contextual band it problem | Langford and Zhan3 


2008 1. also known as associative reinforcement learning I Barto and Anand^ 1985ll . generalizes multi-armed bandits 


by allowing the agent to take actions based on contextual information; in every round, the agent observes the current 
context, takes an action, and observes a reward that is a random variable with distribution conditioned on the context 
and the taken action. Despite many recent advances and successful applications of bandits, one of the major limitations 
of the standard setting is the lack of “global” constraints that are common in many important real-world applications. 
For example, actions taken by a robot arm may have different levels of power consumption, and the total power con¬ 
sumed by the arm is limited by the capacity of its battery. In online advertising, each advertiser has her own budget, so 
that her advertisement cannot be shown more than a certain number of times. In dynamic pricing, there are a certain 
number of objects for sale and the seller offers prices to a sequence of buyers with the goal of maximizing revenue, 
but the number of sales is limited by the supply. 

Recently, a few papers starte d to address this l i mitation by considering very special case s such as a single re- 


2004, Tran-Thanh et al.. 2010l Tran-Thanh et all 20121. and annlication-SDecific bandit problems such as the 

ones motivated bv online advertising 1 Chakrabarti and Vee, 2012, Pandev and Olston 

, 2006tl. dynamic pricing 

llBabaioffetall 20151 

Besbes and Zeevi, 2009ll and crowdsourcing 1 Badanidivuru et al. 

2 OI 2 I ISingla and Krause, 


most previous formulations. In this problem, which they called Bandits with Knapsacks (BwK), there are d differ- 
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ent resources, each with a pre-specified budget. Each action taken by the agent results in a d-dimensional resource 
consumption vector, in addition to the regular (scalar) reward. The goal of the agent is to maximize the total reward, 
while keeping the cumulative resource consumption below the budget. Th e BwK model was further gen eralized to the 
BwCR (Bandits with convex Constraints and concave Rewards) model bv lAgrawal and Devanuii 1201411 . which allows 
for arbitrary concave objective and convex constraints on the sum of the resource consumption vectors in all rounds. 
Both papers adapted the popular Upper Confidence Bound (UCB) technique to obtain near-optimal regret guarantees. 
However, the focus was on the non-contextual se tting. 


There has bee n significant recent progress jA^arwalrt^ 


(instead of linear OAbbasi-vadkori et al.L 12012L iChu et al.L 1201 llB contextual bandits where the context and reward 


2014 iDudfk et al.L 1201 111 in algorithms for general 


can have arbitrary correlation, and the algorithm competes with some arbitrary set of context-dependent policies. 


Dudik et al.l 11201 111 achieved the optimal regret bound for this remarkably general contextual bandits problem, assum¬ 
ing access to the pol icy set only through a linear optimization or acle, instead of explicit enumeration o f all policies 
as in p revious work | Auer et aU 20021 Bevgelzimer et al. , 2011 1. However, the algorithm pr esented in Du^ket^ 

1 201 111 was not tractable in practice, as it makes too many calls to the optimization oracle. Agarwal et akl 1 2014ll 
presented a simpler and computationally efficient algorithm, with a running time that scales as the square-root of the 

logarithm of the policy space size, and achieves a n optimal regret bound . _ 

Combining contexts and resource constraints, Agrawal and Devanur 1 2014|] also considere d a static linear contex¬ 
tual version of BwCR where the expected reward was linear in the context^ Wu et al.l 1 2015l] considered the special 
cas e of random linear c ontex tual bandits with a single budget constraint, and gave near-optimal regret guarantees for 
it. Badanidivuru et al. 1 2014l] extended the general contextual version of bandits with arbitrary policy sets to allow 
budget constraints, thus obtaining a contextual version of BwK, a problem they called Resourceful Contextual Bandits 
(RCB). We will refer to this problem as CBwK (Contextual Bandits with Knapsacks), to be consistent w ith the nam¬ 


ing of related problems defined in the paper. They gave a computationally inefficient algorithm, based on iDudik et al 


0201 ill , with a regret that was optimal in most regimes. Their algorithm was defined as a mapping from the history and 
the context to an action, but the computational issue of finding this mapping was not addressed. They posed an open 
question of achieving computational efficiency while maintaining a similar or even a sub-optimal regret. 


Main Contributions. I n this paper, we prese nt a simple and computationally e fficient algorithm for CBwK/RCB, 


based on the algorithm of Agarwal et al.l 1 2014 1. Similar to Agarwal gt al. 1 2014 1. the running time of our algorithm 


sc ales as the square-ro ot of th e logarithm of the size of the policy seti f/iui resolvi ng the main open question p osed 


by Badanidivuru et al. ^20ldl . Our algor ithm even improves the re gret bound of Badanidivuru et al. 1 2014 1 by a 


factor of y/d. Another improvement over Badanidivuru et all 1 2014 1 is that while they need to know the marginal 
distribution of contexts, ou r algor ithm does not. A key feature of our techniques is that we need to modify the 
algorithm in lAgarwal et al.l 1120141] in a very minimal way — in an almost blackbox fashion — thus retaining the 
structural simplicity of the algorithm while obtaining substantially more general results. 

We extend our algorithm to a variant of the problem, which we call Contextual Bandits with concave Rewards 
(CBwR); in every round, the agent observes a context, takes one of K actions and then observes a d-dimensional 
outcome vector, and the goal is to maximize an arbitrary Lipschitz concave function of the average of the outcome 
v ectors; there are no constraint s. This allows for many more interesting applications, some of which were discussed 


A^rawaland^evanui 120141. This setting is also substantially more general than the contextual version considered 


Agrawal and Devanun 1201411 . where the context was fixed and the dependence was assumed to be linear. 


Organization. In Section|2l we define the CBwK problem, and state our regret bound as Theorem[T] The algorithm 
is detailed in Section [S] and an overview of the regret analysis is in Section |4] In Section |5] we present CBwR, the 
problem with concave rewards, state the guaranteed regret bounds, and outline the differences in the algorithm and the 
analysis. Complete proofs and other details are provided in the appendices. 

* In particular, each arm is associated with a fixed vector and the resulting outco mes for this arm have expected value linear in this vector. 

^Access to the policy set is via an “arg max oracle”, as in lAgarwal et 
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2 Preliminaries and Main Results 


CBwK. The CBwK problem was introduced bv iBadanidivuru et al. 1 2014 1. under the name of Resourceful Contex¬ 
tual Bandits (RCB). We now debne this problem. 

Let be a bnite set of K actions and be a space of possible contexts (the analogue of a feature space in 
supervised learning). To begin with, the algorithm is given a budget B € 5i+. We then proceed in rounds: in every 
round f € [T], the algorithm observes context a;t € chooses an action at € A, and observes a reward rt (at) € [0,1] 
and a d-dimensional consumption vector Vt(at) £ [0,1]'^. The objective is to take actions that maximize the total 
reward, ’’’tiflt), while making sure that the consumption does not exceed the budget, i.e., < 

The algorithm stops either after T rounds or when the budget is exceeded in one of the dimensions, whichever occurs 
first. We assume that one of the actions is a “no-op” action, i.e., it always gives a reward of 0 and a consumption 
vector of all Os. Furthermore, we make a stochastic assumption that the context, the reward, and the consumption 
vectors {xt, {rt{a), Vt(a) : a £ A}) for f = 1, 2,..., T are drawn Lid. (independent and identically distributed) from 
a distribution V over X x [0,1]"^ x [0,1]'^^'^. The distribution V is unknown to the algorithm. 


Policy Set. Following previous work OAgarwal et al.L 1201 4l iBadanidivuru et al.L l20l4Jpudrk et al.L 1201 111 , our algo¬ 


rithms compete with an arbitrary set of policies. Let If C be a finite set of policie^ that map contexts x € X to 
actions a £ A. We assume that the policy set contains a “no-op” policy that always selects the no-op action regardless 
of the context. With global constraints, distributions over policies in If could be strictly more powerful than any policy 
in If itself0 Our algorithms compete with this more powerful set, which is a stronger guarantee than simply competing 
with hxed policies in If. For this purpose, dehne C(n) := {P £ [0, l]'^ : ^('^) = 1} '^^e set of all convex 

combinations of policies in If. For a context x G X, choosing actions with P £ C(n) is equivalent to following a 
randomized policy that selects action a G A with probability P{a\x) = 'Ylnr^n-Tr(x)=a therefore also refer 

to P as a (mixed) policy. Similarly, dehne Co(n) := {P G [0, l]'^ : — 1} '^^e set of all non-negative 

weights over If, which sum to at most 1. Clearly, C(n) C Co (If). 


Benchmark and Regret. The benchmark for this problem is an optimal static mixed policy, where the bud¬ 
gets are required to be satished in expectation only. Let R{P) := y)~Z)[E-!i-~p[^('^(3^))]] V(P) := 

E(a:,r,v)~X)[E7r~p[v(7r(a:))]] denote respectively the expected reward and consumption vector for policy P £ C(n). 
We call a policy P £ C(n) & feasible policy if TV (P) < PI. Note that there always exists a feasible policy in C(n), 
because of the no-op policy. Dehne an optimal policy P* G C (If) as a feasible policy that maximizes the expected 
reward: 

P* = argmaxpgc(n) TR{P) s.t. TV(P) < PI. (1) 

The reward of this optimal policy is denoted by OPT := TR{P*). We are interested in minimizing the regret, dehned 
as 

regret(T) := OPT - p(at)- (2) 


AMO. Since the policy set If is extremely large in most interesting applications, accessing it by explicit enumeration 
is impractical. For the purpose of efficient implementation, we instead only access If via a maximizati on oracle. Em¬ 


ploying such an oracle is common when considering c ontextual bandits with an arbitrary set of policies OAgarwal et al 


20l4lDudik et al.Ll201 iLlLangford and Zhangll2008l] . Following previous work, we call this oracle an “arg max ora¬ 


cle”, or AMO. 


^More generally, different dimensions could have different budgets, but this formulation is without loss of generality: scale the units of all 
dimensions so that all the budgets are equal to the smallest one. This preserves the requirement that the vectors are in [0,1]'*. 

“^The po licies may be randomized in general, but for our results, we may assume without loss of generality that they are deterministic. As 
observed bv iBadanidivuru et iOdail. we may replace randomized policies with deterministic policies by appending a random seed to the context. 
This blows up the size of the context space which does not appear in our regret bounds. 

^E.g., consider two policies that both give rewai'd 1, but each consume 1 unit of a different resource. The optimum solution is to mix uniformly 
between the two, which does twice as well as using any single policy. 
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Definition 1. For a set of policies If, the arg max oracle (AMO) is an algorithm, which for any sequence of contexts 
and rewards, {xi,ri),..., {xt,rt) G X x [0, l]'^, returns 

argmax^gn (3) 


Main Results. Our main r esult is a computati o nally efficient low-regret algorithm for CBwK. Furthermore, we 
improve the regret bound of iBadanidivuru et alJ 020141] by a y/d factor; they present a detailed discussion on the 
optimality of the dependence on K and T in this bound. 


Theorem 1. For the CBwK problem, \/S > 0, there is a polynomial-time algorithm that makes 0{d^KT ln(|n|)) 
calls to AMO, and with probability at least 1 — 5 has regret 


regret{T) = O (^ + l) ^KT\ffidT\\l\/5). 


Note that the above regret bound is meaningful only for B > Fl{^JKT\a.{dT\Il\/5)), therefore in the rest of the 
paper we assume that B > c'KT ln((iT|n|/i5)) for some large enough constant c'. We also extend our results to a 
version with a concave reward function, as outlined in Section|5] For the rest of the paper, we treat (5 > 0 as fixed, and 
define quantities that depend on 5. 


3 Algorithm for the CBwK problem 


From previous work on multi-armed bandits, we know that the key challenges in finding the “right” policy are that (1) 
it should concentrate fast enough on the empirically best policy (based on data observed so far), (2) the probability of 
choosing an action m ust be large enough to enable sufficient exploration, and (3) it should be efficiently computable. 


Agarwal et al.l 1120 1411 show that all these can be addressed by solving a properly defined optimization problem, with 


help of an AMO. We have the additional technical challenge of dealing with global constraints. As mentioned earlier, 
one complication that arises right away is that due to the knapsack constraints, the algorithm has to compete against 
the best mixed policy in If, rather than the best pure policy. In the following, we will highlight the main technical 
difficulties we encounter, and our solution to these difficulties. 

Some definitions are in place before we describe the algorithm. Let Ht denote the history of chosen ac¬ 
tions and observations before time t, consisting of records of the form (cCt, Or, VT-(aT-),Pr(ar)). where 

cCt-,Ut, ri-(aT-), VT-(aT) denote, respectively, the context, action taken, reward and consumption vector observed at 
time r, and Pt{cLt) denotes the probability at which action Ur was taken. (Recall that our algorithm selects actions 
in a randomized way using a mixed policy.) Although Ht contains observation vectors only for chosen actions, it can 
be “completed” using the trick of importance sampling: for every (xt, Ot, rr{ar), Vr{ar),PT{aT)) G Ht, define the 
fictitious observation vectors fr G [0,1]"^, Vr G [0,by: 


fr{a) 

Vr(a) 


Tj- ) 

Pt (^r) 
Vr(gT) 
Pt (gr) 


I {dr 

■I {dr 


d} , 

a} . 


Clearly, -Tt, '^t are unbiased estimator of r,-, v,-: for every a, [j^T(g)] = ?’T(g), [vrCa)] = v^(a), where the 
expectations are over randomization in selecting dr- 

With the “completed” history, it is straightforward to obtain an unbiased estimate of expected reward vector and 
expected consumption vector for every policy P G C(n): 

Rt{P) := [f,-(7r(x^))] , 

Vt(P) := [v^(7r(x.,-))] . 


The convenient notation r ^ [f] above, indicating that t is drawn uniformly at random from the set of integers 
{1, 2,... ,t}, simply means averaging over time up to step t. It is easy to verify that E[J?t(P)] = R{P), and 
E[Vt{P)] = V(P). 
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Given these estimates, we construct an optimization problem (OP) which aims to find a mixed policy that has a 
small “empirical regret”, and at the same time provides sufficient exploration over “good” policies. The optimization 


problem uses a quantity Regj(P), “the empirical regret of policy P”, to characterize good policies. lAgarwal et al. 


1 201411 define Reg((P) as simply the difference between the empirical reward estimate of policy P and that of the 
policy with the highest empirical reward. Thus, good policies were characterized as those with high reward. For our 
problem, however, a policy could have a high reward while its consumption violates the knapsack constraints by a 
large margin. Such a policy should not be considered a good policy. A key challenge in this problem is therefore to 
define a single quantity that captures the “goodness” of a policy by appropriately combining rewards and consumption 
vectors. 

We define quantities Reg(P) (and the corresponding empirical estimate Reg((P) up to round t) of P G C(n) by 
combining the regret in reward and constraint violation using a multiplier “Z”. The multiplier captures the sensitivity 
of the problem to violation in knapsack constraints. It is easy to observe from ([T]i that increasing the knapsack size 
from P to (1 + e)B can increase the optimal to atmost (1 + e)OPT. It follows that if a policy violates any knapsack 
constraint by 7 , it can achieve at most ^^7 more reward than OPT. More precisely. 

Lemma 2. For any b, let OPT{b) denote the value of an optimal solution of ([T]) when the budget is set as b. Then, for 
any 5 > 0, 7 > 0, 

OPT{b + 7 ) < OPT{b) + 0^7 . ( 4 ) 

We use this observation to set Z as an estimate of We do this by using the outcomes of the first 

rp 12KT d|n| 

Jo — —g—in — 

rounds, during which we do pure exploration (i.e., play an action in A uniformly at random). For notational con¬ 
venience, in our algorithm description we will index these initial Tq exploration rounds as f = — (Tq — 1), — (To — 
2),..., 0, so that the major component of the algorithm can be started from t = 1 and runs until t = T — Tq. The 
following lemma provides a bound on the Z that we estimate. Its proof appears in AnnendixlBl 

1 o zz'~r~' H I I~r I 

Lemma 3. For any B, using the first To = —In rounds of pure exploration, one can compute a quantity Z 
such that with probability at least 1 — 5, 


niax{^, 1} < Z < 


2AOPT 


Now, to define Reg(P) and Regj(P), we combine regret in reward and constraint violation using the constant Z 
as computed above. In these definitions, we use a smaller budget amount 

B' ■= B-To- cs/KT\n{T\Il\/S), 

for a large enough constant c to be specified later. Here, the budget needed to be decreased by To to account for 
budget consumed in the first Tq exploration rounds. We use a further smaller budget amount to ensure that with high 
probability (1 — 5) our algorithm will not abort before the end of time horizon (T — To), due to budget violation. For 
any vector v £ let B') denote the amount by which the vector v violates the budget B', i.e., 

(j){v, B') := maxj=i,,,,_d (^Vj - . 

Let P' denote the optimal policy when budget amount is B', i.e., 

P' := argmaxpgc(n) TR{P) s.t. TV{P) < B'l. 

And, let Pt denote the empirically optimal policy for the combination of reward and budget violation, defined as: 

Pt := argmaxpgc(n) ^t(T) - Zf{'Vt{P),B'). (5) 

We define 

Reg(T) := ^(i?(P') - RiP) + Zfi(Y{P), B')), 
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Reg,(P) := 


Iz+I) 


Rt{Pt) - B') - (Rt{P) - Zct>{Vt{P), P')) 


Note that Reg(P') = 0 and Regj(Pt) = 0 by definition. 

We are now ready to describe the optimization problem, (OP). This is essentially the same as the optimization 
problem solved in Agarwal et alJ ll2014ll . except for the new definition of Reg((P), which was described above. It 
aims to find a mixed policy Q G Co(n). This is equivalent to finding a Q' G C(n) and a G [0,1], and returning 
Q = aQ'. Let denote a smoothed projection of Q, assigning minimum probability /i to every action: Q^{a\x) := 
(1 — K^)Q{a\x) + /i. (OP) depends on the history up to some time t, and a parameter pm that will be set by the 
algorithm. In the rest of the paper, for convenience, we define a constant ijj := 100. 


Optimization Problem (OP) 

Given: Pit, fim, and ij;. 

Let bp :=^^,VPeC(n). 

Find a Q' G C(n), and an a G [0,1], such that the following inequalities hold. Let 
Q = aQ'. 

a-bq' < 2K, 

1 


VP S C(n) : [(]]E,r~p 


_Ql^’n(^Tf(^Xr)\Xr) _ 


<bp + 2K. 


The first constraint in (OP) is to ensure that, under Q, Reg^ is “small”. In the second constraint, the left-hand side, as 
shown in the analysis, is an upper bound on the variance of estimates Rt (P), Vt (P). These two constraints are critical 
for deriving the regret bound in Section @1 We give an algorithm that efficiently finds a feasible solution to (OP) (and 
also shows that a feasible solution always exists). 

We are now ready to describe the full algorithm, which is summarized in Algorithm [T] The m ain body of the 


algor ithm shares the same structure as the ILOVETOCONBANDITS algorithm for contextual bandits OAgarwal et al. 


201411 . with important changes necessary to deal with the knapsack constraints. We use the first Pq rounds to do pure 
exploration and calculate Z as given by LemmaO (These time steps are indexed from — (To — 1) to 0.) The algorithm 
then proceeds in epochs with pre-defined lengths; epoch m consists of time steps indexed from Tm-i + 1 to r^, 
inclusively. The algorithm can work with any epoch schedule that satisfies Tm < Tm+i < 2rm. Our results hold for 
the schedule where Tm = 2"*. However, the algorithm can choose to solve (OP) more frequently than what we use 
here to get a lower regret (but still within constant factors), at the cost of higher computational time. At the end of an 
epoch m, it computes a mixed policy in Qm S Co(n) by solving an instance of OP, which is then used for the entire 
next epoch. Additionally, at the end of every epoch m, the algorithm computes the empirically best policy P^^ as 
defined in Equation (|5]l, which the algorithm uses as the default policy in the sampling process defined below. Pq can 
be chosen arbitrarily, e.g., as uniform policy. 

The sampling process, Sample(x, Q, P, p) in Step 8 , samples an action from the computed mixed policy. It takes 
the following as input: x (context), Q G Co (H) (mixed policy returned by the optimization problem (OP) for the 
current epoch), P (default mixed policy), and p > 0 (a scalar for minimum action-selection probability). Since Q may 
not be a proper distribution (as its weights may sum to a number less than 1), Sample first computes Q G C(n), by 
assigning any remaining mass (from Q) to the default policy P. Then, it picks an action from the smoothed projection 

of this distribution defined as: Q''{a\x) = (1 — KfP)Q{a\x) -f /i, Va G A. 

The algorithm aborts (in StepfTOli if the budget B is consumed for any resource. 


3.1 Computation complexity: Solving (OP) using AMO 


Algorithm[T]requires solving (OP) at the end of every epoch. lAgarwal et al.l 020141] gave an algorithm that solves (OP) 
using access to the AMO. We use a similar algorithm, except that calls to the AMO are now replaced by calls to a 
knapsack constrained optimization problem over the empirical distribution. This optimization problem is identical in 
structure to the optimization problem defining Pt in (|5|i, which we need to solve also. We can solve both of these 
problems using AMO, as outlined below. 
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Algorithm 1 Adapted from ILOVETOCONBANDITS 

Input Epoch schedule 0 = tq < n < r 2 < • • • such that Tm < Tm+i < 2rm, allowed failure probability S € (0,1). 

1 : Initialize weights Qo := 0 G Co(n), Pq G C(n) and epoch m := 1. 

Define := min{^, A/ln(16r^(d + l)\T\\/5)/{KTm)} for all m > 0 . 

2 : for round < = — (Tq — 1),..., 0 do 

3: Select action at uniformly at random from the set of all arms. 

4: end for 

5: Compute Z as in Lemma[3 
6: for round f = 1, 2 ,... do 

7: Observe context xt G X. 

8: {at,pt{at)) ■■= Samp\e{xt,Q m—1; _17 Mm —l)- 

9: Select action at and observe reward rt{at) G [0,1] and consumption \-t{at). 

10: Abort unless X]t=-(To-i) '^riar) < Bl. 

11 : iff = Tmthen 

12: Let Qm be a solution to (OP) with history Ht and minimum probability pm- 

13: m := m + 1. 

14: end if 

15: end for 


We rewrite Q as a linear optimization problem where the domain is the intersection of two polytopes. The domain 
is [0,we represent a point in this domain as {x, y, A), where x and A are scalars and y is a vector in d dimensions. 
Let 

Ki := {(x,y,A) : x = Rt{P),y = Vt(P) for some P G C(n),A G [0,1]}, 

be the set of all reward, consumption vectors achievable on the empirical outcomes upto time t, through some policy 
in C(n). Let 

K 2 := {(a:, y, A) : y < (P'/T + A)l} n [0, l]''+2, 
be the constraint set, given by relaxaing the knapsack constraints by A. Now (|5]l is equivalent to 


maxx — ZX such that (a;, y. A) G Ki fl /C 2 . 


( 6 ) 


Recently, iLee et al.l 020 1 51 Theorem 49] gave a fast algorithm to solve problems of the kind above, given access to 
oracles that solve linear optimization problems over Ki and 1^2 0 The algorithm makes 0{d) calls to these oracles, 
and takes an additional 0{d^) running time0 A linear optimization problem over Ki is equivalent to the AMO; the 
linear function defines the “rewards” that the AMO optimizes for0 A linear optimization problem over K 2 is trivial 
to solve. As an aside, a solution Q G Co(n) output by this algorithm has support equal to the policies output by the 
AMO during the run of the algorithm, and hence has size 0{d). 

Using this, (OP) can be solved using 0{d^JKT ln(|n|)) calls to the AMO at the end of every epoch, and (5) can 
be solved using 0{d) calls, giving a total of 0{dy^ KT ln(|n|)) calls to AMO. The complete algorithm to solve (OP) 
is in Appendix ICl 


4 Regret Analysis 


This section provides an outline of the proof of Theorem [T] which provides a bound on the r egret of Algor i thm [T 


(A complete proof is given in Appendix |D] ) The proof structure is similar to the proof of lAgarwal et al.l 1201 


Theorem 2], with major differences coming from the changes necessary to deal with mixed policies and constraint 
violations. We defined the algorithm to minimize Reg (through the first constraint in the optimization problem (OP)), 


® Alternately, one could use the algorithms of lVaidyj Il989all3l to solve the same problem, with a slightly weaker polynomial mnning time. 
^Here, O hides terms of the order where e is the accuracy needed of the solution. 

^These rewards may not lie in [0,1] but an affine transformation of the rewards can bring them into [0,1] without changing the solution. 
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and the first step is to show that this i mplies a bound on Re g as well. The alternate definitions of Reg and Reg require 
a different analysis than what was in Agarwal et al. I 2014ll . and this difference is highlighted in the proof outline of 
Lemma|5]below. Once we have a bound on Reg, we show that this implies a bound on the actual reward R, as well as 
the probability of violating the knapsack constraints. 

We start by proving that the empirical average reward Rt (P) and consumption vector V* (P) for any mixed policy 
P are close to the true averages R{P) and V(P) respectively. We define mo such that for initial epochs m < mo, 
Hm = 2 ^. Recall that is the minimum probability of playing any action in epoch m + 1, defined in Step [T] of 
Algorithm [1] Therefore, for these initial epochs the variance of importance sampling estimates is small, and we can 
obtain a stronger bound on estimation error. For subsequent epochs, decreases, and we get error bounds in terms 
of max variance of the estimates for policy P across all epochs before time t, defined as Vt{P)- In fact, the second 
constraint in the optimization problem (OP) seeks to bound this variance. 

The precise definitions of above-mentioned quantities are provided in Appendix iDl 


Lemma 4. With probability 1 — for all policies P £ C (Tf), 


max{|4(P) - RtiP)\, ||V,(P) - V(P)||oo} < 


8Kdt 

t 


Vt(P)pm-l 



t G epoch mo,t > to 
t € epoch m,m> mo 


Here, dt = ln(16f^|n|(d + l)/5),to '■= min{f € N : ^ < ^}, mo ■= min{m G N : < ^}. 

Now suppose the error bounds in above lemma hold. A major step is to show that, for every P G C(n), the 
empirical regret Regj(P) and the actual regret Reg(P) are close in a particular sense. 

Lemma 5. Assume that the events in Lemma^hold. Then, for all epochs m > mo, all rounds t > to in epoch m, and 
all policies P G C(n), 

Reg{P) < 2Regt{P) + coKpra, and Reg^{P) < 2Regf{P) + coATpm, 
for Reg{P),Reg^{P) as defined in Section^ and Cq being a constant smaller than 150. 

Proof Outline. The proof of above lemma is by induction, using the second constraint in (OP) to bound the variance 
Vt(P). Below, we prove the base case. This proof demonstrates the importance of appropriately chosing Z. Consider 
m = mo, and f > fg in epoch m. For all P G C(n), 

(Z + l)(Ri,(P)-Reg(P)) = RtiPt) - Rt{P) - R{P') + RiP) (7) 

-Z[f{Yt{Pt),B') - fi{Vt{P),B') + ct>{V{P), B')]. 

We can assume that B > d a/ KT ln(dT|n|/b) for any constant d (otherwise the regret guarantees in Theorem[T]are 
meaningless). Then, we have that B > 2To + 2c^KThr(T\Y[\/5) = 2{B — B') implying B' > -j. Also, observe 
that since B > B', OPT(i?) > OPT(P'). Then, by Lemma|2]and choice of Z as specified by Lemma[2 we have that 
for any 7 > 0 

OPT(P' + 7 ) < OPT(P') -b § 7 . (8) 

Now, since P' is defined as the optimal policy for budget B', we obtain that R{P') = OPT(P'). Also, by definition 
of fiy (Pt), B'), we have that R{Pt) < OPT(i?' + (j){y{Pf),B')), and therefore, 

R{P') > R{Pt)) - > R{Pt)) - Z4>{Y{P,),B'). 

Substituting in (ITJ, we can upper bound {Z + l)(Regj(P) — Reg(P)) by 


Rt(Pt) - Rt{P) - RiPt) + Zd(Y{Pt),B') + P(P) 

-Z[fiYt{Pt),B') - fiYtiP),B') + fi{Y{P),B')] 

< \Rt{Pt) - R{Pt)\ + |4(P) - P(P)| + Z\\YtiPt) - V(Pt)||oo + ^||Vt(P) - V(P)||oo 














For the other side, by dehnition of Pt, we have that R{Pt)) — ZcjiCV(Pt), B') > R{P) — ZtjiCV(P), B'). Substi¬ 
tuting in (Q as above, and using that (j){'V{P'), B') = 0, we get a similar upper bound on {Z -|- l)(Reg(P) — Regj(P)). 
Now substituting bounds from LemmalU we obtain, 

|Rii*(P) - Reg(P)| < 4^^ < coK^JL^. 

This completes the base case. The remaining proof is by induction, using the bounds provided by Lemma |4] for 
epochs m > nio in terms of variance Vt(-), and bound on variance provided by the second constraint in (OP). The 
second constraint in (OP) provides a bound on the variance of any policy P in any past epoch, in terms of Reg^ (P) 
for T in that epoch; the inductive hypothesis is used in the proof to obtain those bounds in terms of Reg(P). □ 

Given the above lemma, the hrst constraint in (OP) which bounds the estimated regret Reg(((5) for the chosen 
mixed policy Q, directly implies an upper bound on Reg((5) for this mixed policy. Specihcally, we get that for every 
epoch TO, for mixed policy Qm that solves (OP), 

Reg(Qm) < (co + 2)K'll)fjLm- 

Next, we bound the regret in epoch to using above bound on Reg((5m-i)- For simplicity of discussion, here we 
outline the steps for bounding regret for rewards sampled from policy Qm-i in epoch to. Note that this is not 
precise in following ways. First, Qm-i G Co(n) may not be in C(n) and therefore may not be a proper distribution 
(the actual sampling process puts the remaining probability on default policy Pt to obtain Qt at time t in epoch to). 
Second, the actual sampling process picks an action from smoothed projection of Qt- However, we ignore these 

technicalities here in order to get across the intuition behind the proof; these technicalities are dealt with rigorously in 
the complete proof provided in AppendixiDl 

The hrst step is to use the above bound on Reg(Qm-i) to show that expected reward R{Qm-i) in epoch to is 
close to optimal reward R{P*). Since B') is always non-negative, by dehnition of Reg((5), for any Q 

[Z + l)Reg(Q) > R{P') - R{Q) > R[P*) - R[Q) - ^ 

where we used Lemma |2 to get the last inequality. If the algorithm never aborted due to constraint violation in Step 
ITOl the above observation would bound the regret of the algorithm by 

OPT 

^(i?(P*) - R{Qm-l)){Tm - Tm-l) < + l)(co + " T^-l) P—{B- B'). 

m m 

Then, using that Z < 0(2^^), B — B' = KT ln((iT|n|/(5), and properly chosen scaling factors {ip and /r^) 
result in the desired bound of KT ln(dr|n|/(5)) for expected regret. An application of Azuma-Hoeffding 

inequality obtains the high probability regret bound as stated in Theorem[T] 

To complete the proof, we show that in fact, with probability 1 — f, the algorithm is not aborted in Step fTOl 
due to constraint violation. This involves showing that with high probability, the algorithm’s consumption (in steps 
t = 1,..., To) above B' is bounded above by Ca /KT ln(|n|/(5), and since B' + c^JKT ln(|n|/(5) -|- Tg = P, we 
obtain that the algorithm will satisfy the knapsack constraint with high probability. This also explains why we started 
with a smaller budget. More precisely, we show that for every to, 

<^(V(Q„), B’) < 4(co + 2)K1P^1^ (9) 

Recall that cp{Y{P), B') was dehned as the maximum violation of budget by vector V(P). To prove the above, 
we observe that due to our choice of Z, (p(V{P), B') is bounded by Reg(P) as follows. By Equation ([8]l, for all 
P € C(n), R{P') > R{P) - f <^(V(P), P')> so that 

(Z + 1) Reg(P) = P(P') - P(P) + Z<P{Y{P), B') > I<^(V(P), S). 

Then, using the bound of Reg((5m) < (cg + 2)KipfXrn, we obtain the bound in Equation (|9]l. Summing this bound 
over all epochs to, and using Jensen’s inequality and convexity of (/)(•, B'), we obtain a bound on the max violation 
of budget constraint by the algorithm’s expected consumption vector ^ ~ An-i)- This is 

converted to a high probability bound using Azuma-Hoeffding inequality. 
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5 The CBwR problem 


In this section, we consider a version of the problem with a concave objective function, and show how to get an efficient 
algorithm for it. The CBwR problem is identical to the CBwK problem, except for the following. The outcome in 
a round is simply the vector v, and the goal of the algorithm is to maximize /(y X]t=i for some concave 

function / defined on the domain [0,1]'^, and given to the algorithm ahead of time. The optimum mixed policy is now 
defined as 

P* = arg max^/(V(P)). (10) 

The optimum value is OPT = /(V(P*)) and we bound the average regret, which is 

avg-regret := OPT - / ELi Mat)) ■ 

The main result of this section is an 0(1/VT) regret bound for this problem. Note that the regret scales as Xj^fT 
rather than v/T since the problem is defined in terms of the average of the vectors rather than the sum. We assume that 
/ is represented in such a way that we can solve optimization problems of the following form in polynomial timeU 
For any given a gM, 

max/(a;) + a ■ x : x G [ 0 , 1 ]"^. 

Theorem 6. For the CBwR problem, if f is L-Lipschitz w.r.t. norm || • ||, then there is a polynomial time algorithm 
that makes 0(d^JKT ln(|n|)) calls to AMO, and with probability at least 1 — i5 has regret 

avg-regret{T) = O [sjK ln(r|n|/^) + . 

Remark. A special case of this problem is when there are only constraints, in which case / could be defined as 
the negative of the distance from the constraint set. Further, one could handle both concave objective function and 
convex constraints as follows. Suppose that we wish to maximize X]t=i Vt(at)), subject to the constraint that 
y X]t=i '^t{at) G S, for some L-Lipschitz concave function h and a convex set S. Further, suppose that we had a 
good estimate of the optimum achieved by a static mixed policy, i.e., 

OPT':= max hCV(P)) s.t. V(P) e S'. (11) 

PGC(n) 

For some distance function df, S) measuring distance of a point from set S, define 

/(v) := min {/i(v) — OPT', —Ld{v, S)} . 


5.1 Algorithm 

Since we don’t have any hard constraints and don’t need to estimate Z as in the case of CBwK, we can drop Steps 2-5 
and Step 10 in Algorithm[T] and set Tq = 0. The optimization problem (OP) is also the same, but with new definitions 
of Reg(P), Pt and Reg((P) as below. Recall that P* is the optimal policy as given by Equation ( fTOb . and L is the 
Lipschitz factor for / with respect to norm || • ||. We now define the regret of policy P G C (11) as 

Reg(P) := (/(V(P*)) - /(V(P))). 

The best empirical policy is now given by 


Pt := argmaxpgc(n) fMt{P)), 


( 12 ) 


and an estimate of the regret of policy P G C (11) at time t is 

^MP) ■■= p^(/(Vt(Pt)) - /(Vt(p))). 

^This problem has nothing to do with contexts and policies, and only depends on the function /. 
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Another difference is that we need to solve a convex optimization problem to find Pt (as defined in (fT^ l once every 
epoch. A similar convex optimization problem needs to be solved in every iteration of a coordinate descent algorithm 
for solving (OP) (details of this are in Appendix 1C. 2b . In both cases, the problems can be cast in the form 

minp(a:) : x € C, 


where p is a convex function, C is a convex set, and we are given access to a linear optimization oracle, that solves 
a problem of the form mine ■ x ■. x G C. In (fT2l i for instance, C is the set of all Vi(P) for all P G C(n). A linear 
optimization oracle over this C is just an AMO as in Definition We show how to efficiently solve such a convex 
optimization problem using cutting plane methods ll Vaidva . 1989alLee et ah . 2015ll . while making only 0(d) calls to 
the oracle. The details of this are in Appendix lC.2l 


5.2 Regret Analysis: Proof of Theorem |6] 

We prove that Algorithm [T] and (OP) with the above new definition of Reg((P) achieves regret bounds of Theorem|6] 
for the CBwR problem. A complete proof of this theorem is given in Appendix|E] Here, we sketch some key steps. 

The first step of the proof is to use constraints in (OP) to prove a lemma akin to Lemma |5] showing that the 
empirical regret Reg^(P) and actual regret Reg(P) are close for every P G C(n). Therefore, the first constraint 
in (OP) that bounds the empirical regret Reg((Qm) of the computed policy implies a bound on the actual regret 
Reg((3m) = xpj(/(V(P*)) — f(V(Qm)))- Ignoring the technicalities of sampling process (which are dealt with 
in the complete proof), and assuming that Qm-i is the policy used in epoch m, this provides a bound on regret in 
every epoch. Regret across epochs can be combined using Jensen’s inequality which bounds the regret in expectation. 
Using Azuma-Hoeffding’s inequality to bound deviation of expected reward vector from the actual reward vector, we 
obtain the high probability regret bound stated in Theorem|6] 
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Appendix 


A Concentration Inequalities 


Lemma 7. (Freedman’s inequality for martingales llBevgelzimer et alll201 1 11 Let Xi^X 2 ,..., Xt be a sequence of 
real-valued random variables. Assume for all t G {1, 2,..., T}, |Xt| < R and¥\Xt\Xi^... ,Xt-i\ = 0. Define 
S := and V := , Xt-i], For any p G (0,1) and A € [0,1/i?], with probability at least 

1 - A 

S<(e- 2)XV + i In i . 

A p 


Lemma 8 . (Multiplicative version of Chernoff bounds) Let Xi,..., Xn denote independent random samples from a 
distribution supported on [a, b] and let p := Then, for all e > 0, 


Pr 




i=l 



< exp 


A 

3(6-a )2 J ■ 


Corollary 9. Let Xi,... ,Xn denote independent random samples from a distribution supported on [a, 6] and let 
p := Xi\. Then, for all p > 0, with probability at least 1 — p. 




2=1 


< {h — a) 


3plog(l/p) 


Proof Given p > 0, use Lemma[ 8 ]with 




' 31og(l/p) 

A 


to get that the probability of the event | X]r=i ~ aI > = (^ ~ a)\/3plog(l/p) is at most 


exp 


AE 


3(6 — aY 


= exp (- log(l/p)) = p. 


□ 


B Setting Z (Proof of Lemma I2) 

#I#r‘GenericWarning (hyperref) Package hyperref Warning: Token not allowed in a PDF string 

(PDFDocEncoding):removing ‘math shift’#I#r‘GenericWarning (hyperref) Package hyperref Warning: 

Token not allowed in a PDF string (PDFDocEncoding):removing ‘math shift’ 

We use the first few rounds to do a pure exploration, that is Ot is picked uniformly at random from the set of arms, and 
use the outcomes from these results to compute an estimate of OPT. Let 

ft{a) := rt{a) • I{a = aj , 

Vt(a) = Vt(a) • I{a = at} . 

Note that ff (a) € [0,1], vt(-P) G [0,1]'^. Since Gt is picked uniformly at random from the set of arms, 

E[ft(a)|i7t-i] = ^E[rt(a)], and E[vt(a)|i7t-i] = ^E[vt(a)]. 

K K 

Eor any policy P G Co(n), let 

r{P) := E(a._r_v)~i 5 , 77 -p[r( 7 r(a;)] 
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ft{P) ■=^Y1 lE^-pK(7r(a;^))] 

TG[t] 

v(P) := E(^^r_v)~i 5 , 77 ^p[v( 7 r(x)] 

MP) — T X! E^~p[vr(7r(xr))] 

TG[t] 

be the actual and estimated means of reward and consumption for a given policy P, and |sMpp(P)| denote the size 
of the support of P. Interpreting a policy tt e 11 as a (degenerated) distribution of policies in H, we slightly abuse 
notation, defining r( 7 r), ft( 7 r), v( 7 r), and Vt( 7 r) similarly. Observe that for any P G Co(n), 

E[ft{P)\Ht-i] =r{P), andE[vt(P)|Pt_i] = v(P). 

Lemma 10. For all 5 > 0, letrj := y/ SK log{{d + l)|n|/(5). Then for any t, with probability 1 — 5, for all P G Cq (H), 

IniP) - r{P)\ < r]y/r{P)/t, 

V j, \MP)j - v(-P)il < r]^Jv{P)j/t. 

Proof We will first show the first inequality holds with probability 1 — 5/{d+l). The same analysis can be applied 
to each of the d dimensions of the consumption vector. The lemma follows by a direct use of the union bound. 

Fix a policy tt G 11. Consider the random variables Xt = fT-(7r(Xr)), for r G [t]. Note that Xj. G [0,1], 
E[X^] = ^r( 7 r), and Xr = Applying Corollary |9] to these variables, we get that with probability 

i-(5/((d + i)|n|), _ _ 

- ^r(7r)| < A/31og((d + l)|n|/(5)v/r(7r)/A:f. 

Equivalently, 

■ (13) 

Applying a union bound over all tt G H, we have, with probability 1 — S/{d + 1), that Equation (fOl l holds for all 
TT G n. In the rest of the proof, we assume Equation (fTsl i holds. 

Now consider a policy P G Co (11). 


Ift(P) - r(P)l < E,,.,.p[|ft( 7 r) - r( 7 r)|] 

< E^r^p[r/y/ r{'K)/i\ 

< ?7V^E^„.p[r(7r)]/f]. 

= ?7v/r(P)A . 


The inequality in the third line follows from the concavity of the square root function. 


□ 


^ 7 

We solve a relaxed optimization problem on the sample to compute our estimate. Define OPT^ as the value of 
optimal mixed policy in Co( 11 ) on the empirical distribution up to time t, when the budget constraints are relaxed by 


r- 


OPtJ:= 


maxpgco(n) Tft{P) 

s.t. Tvt (P) < (5 + 7)1 


(14) 


Let Pt G Co(n) be the policy that achieves this maximum in (fl4l i. Let (as earlier) P* denote the optimal policy w.r.t. 
V, i.e., the policy that achieves the maximum in the definition of OPT. 

Lemma[3is now an immediate consequence of the following lemma, for 7 and t as in the lemma, by setting 


Z = max{ 


80PtJ^ 

B 


, 1 }. 
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Lemma 11. Suppose that for the first t := 12K ln( j B rounds the algorithm does pure exploration, pulling 

each arm with equal probability, and let 7 := -j- Then with probability at least 1 — i5, 

OPT < max{2(9>r^, B}<2B + WPT. 

Proof. Let rj = y^3K\og{{d + l)|n|/i5) be as in LemmafTOl Observe that then rj/Vt = y^B/AT and 771/ BT/t = 7. 
By LemmafTol with probability 1 — i5, we have that 

Vt(P*) < ^1, 

^ 'y 

and therefore P* is a feasible solution to the optimization problem (ff^ . and hence OPTj > Tft{P*). Again from 
LemmafTol 

Tft{P*) > OPT - qs/TOPT/t = OPT - (VOFLB)/2. 

Now either B > OPT or otherwise 

OPT - (VOPTB)/2 > OPT/2. 

In either case, the first inequality in the lemma holds. 

On the other hand, again from Lemma [Tol 

V j, v(Pt)j - r]^Jv{Pt)j/t < \r{Pt)j 

< {B + l)/T 
= 3P/2T 

= 9B/4T - qy'^B/ATt. 

The second inequality holds since Pt is a feasible solution to (fl4l i. The function f{x) = x — y/cx is increasing 
in the interval [c/4, 00] and therefore v{Pt)j < 9B/AT, and Pt is a feasible solution to the optimization problem 
([T]i, with budgets multiplied by 9/4. This increases the optimum value of ([I]l by at most a factor of 9/4 and hence 
Tr{Pt) < 90PT/4. 

Also from Lemma ITOl 

OPT^ = TfiPt) < Tr{Pt) + 7jT^r{Pt)/t 
< 90PT/4 + a/90PTP/16. 

Once again, if OPT > B, we get from the above that OPtJ < 30PT. Otherwise, we get that OPT^ < 90PT/4+3B/4. 
In either case, the second inequaity of the lemma holds. 

□ 

C Implementation details: Solving Optimization Problem (OP) by Coordi¬ 
nate Descent 

At the end of every epoch m of Algorithm [T] we solve an optimization problem (OP) to find Qm G Co (11). The 
same optimization problem is used for both CBwK and CBwR, although with different definitions of Reg^ (•). In this 
section, we show how to solve the optimization problem (OP) using a Coordinate Descent descent algorithm along 
with AMO, for both CBwK and CBwR. 

In this optimization problem (OP), described in SectionfS] Q G Co(n) was expressed as aQ' for some Q' G C(n). 
It is easy to see that any Q G Co (11) can also be expressed as a linear combination of multiple mixed policies in C(n): 

Q= X! '^piQ)B, 

Pec(n) 
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for some constants {ap{Q)}p^c{n), so that 

yP €C(Il) : ap{Q)>0 and ^ ap(Q) < 1. 

PeC(n) 

Note that the coefficients {ap{Q)} may not be unique. Now, consider the following variant of (OP); 


Optimization Problem (OP’) 

Given: Ht, Pm, and ip. 

Let bp := , VP £ C(n) where ip := 

100. 


Find Q = (I]pGC(n) o:piQ)P) ^ Co(n), such that 


X o:piQ)bp < 2AT, 

PGC(n) 


VP £ C(n) : 

1 

<bp + 2K. 

_Qf^rn(Tr(x)\x) _ 


Lemma 12. The two optimization problems, (OP) and (OP’), are equivalent. 

Proof. It suffices to prove that, any feasible solution to one problem provides a feasible solution to the other. To see 
this, first note that any solution Q £ Co(n) to (OP) is trivially a solution to (OP’). 

For the other direction, suppose we are given a solution Q £ Co(n) to (OP’). Set Q' = a~^ SpGC(n) Oip{Q)P 
with a = X]pGC(n) ctp(Q)i clearly, Q' £ C(n). Then, by Jensen’s inequality, as well as the second condition of 
(OP’), we have 


aRegt(Q') < a ^ 


ap(Q) 


PGC(n) Sp6C(n) «p(<3) 


Reg,(P) 


PeC(n) 

= timf X ^P^Q)^P 

PGC(n) 

< 2K'll}p,rn ■ 

Thus, first constraint of (OP) is satished. Also, since aQ' = Q, the second constraint of (OP) is trivially satished. 
Therefore, aQ' is a feasible solution to (OP). □ 

In the rest, we show how to solve (OP’) using a coordinate descent algorithm, which assigns a non-zero weight 
ap(Q) to at most one new policy P £ C(n) in every iteration. 

Let us fix m and use shorthand p for p,m- Problem (OP’) is of the same form as the optimization problem in 
lAgarwal et al. 1 2014|] , except that the policy set being considered i s C (If) instead of If. We can solve it using Algorithm 
121 a coordinate descent algorithm similar to Agarwal et al. I 20l4 Algorithm 2]. 

The lemma below bounds the number of iterations in this algorithm. 

Lemma 13. The number of times Step 8 of the algorithm is performed is bounded by 4 ln(l / (Kp,)) / p. 

Proof. This follows by applying the analysis of Algorithm 2 in Agarwal et al.l ll2014ll (refer to Section 5) with policy 
set being C (If) instead of If. (Their analysis holds for any value of constant p, and constants 6^ for policies in the 
policy set being considered). □ 
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Algorithm 2 Coordinate Descent Algorithm for Solving (OP) 

Input History Ht, minimum probability fi > 0, initial weights Qinit S Co (If). 
1- Q ^ f^init- 

2 : loop 

3: Debne, for all P £ C(n), 


VpiQ) 

'■= 

Ex^Ht 

1 


_Qt^{Tt{x)\x) _ 


Sp(Q) 

:= Ett^p 

Ex^Ht 

1 



DpiQ) 

'■= Vp{Q) 

- {2K + bp). 



4: if X]p£C(n) '^p(*5)(2A: + 6_p) > 2iC then 

5: Replace Q by cQ so that Q £ C(n), where 


Spec(n) oip{Q){2K + bp) 

6 : end if 

7: if there is a P £ C(n) for which Dp{Q) > 0 then 

8: Update the coefficient for P by 


oip{Q) otp{Q) + 


Vp{Q) + Dp{Q) 
2(1 - Kfi)Sp{Q) ■ 


9: else 

10: Halt and output the current set of weights Q. 

11 : end if 

12 : end loop 


(15) 


Now, since in epoch m, p, = Pm > y = ln(16f^|n|((i+l)/(5). This proves that the algorithm converges 

in at most 0{yj KT ln(T|n|/(5) ln(Tln(T|n|))) = 0{yjKT ln(|n|)) iterations of the loop. 

Next, we discuss how to implement each iteration of the loop. In every iteration in Step 8, we need to identify a P 
for which Dp{Q) > 0, for which we need to access the policy space using AMO. Also, in the beginning before the 
loop is started, one needs to compute Pt by solving an optimization problem over the policy space. Below, we provide 
an implementation of these optimization problems using AMO. Since Reg((P) is dehne differently for CBwK and 
CBwR, the implementation details and number of calls to AMO differ. But importantly, as we show in LemmaflSland 
Lemma [Thl in both cases each iteration of Algorithm |2] can be implemented using 0{d) of AMO calls. Using these 
results with the above lemma, we obtain that 

Lemma 14. For CBwK and CBwR, (OP) can be solved using 0{dy/ KT ln(|n|)) calls to the AMO at the end of every 
epoch. 

As an aside, a solution Q £ Co(n) output by this algorithm has support bounded by the number of calls to AMO 
during the run of the algorithm (AMO maximizes a linear function, therefore always returns a pure policy). Therefore, 
the results in the subsections below also prove that the policies returned by this algorithm have small support, and can 
be compactly represented. 
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C.l AMO-based Implementation for CBwK 

Lemma 15. For CBwK, Algorithm^can be implemented using 0{d) call to AMO (Definition\J} in the beginning 
before the loop is started, and 0{d) calls for each iteration of the loop thereafter. 

Proof In the beginning before the loop is started, one needs to compute Pt which means solving the following problem 
on the policy space; 

argmaxpgc(n) Rt{P) - ZfiYtiP), B'). (16) 

Using the definition of ff, •), observe that this is same as 


argmaxpgc(n),A Rt{P)-ZX 
s.t. Vt{P) < (^ + \)1- 


(17) 


In every iteration of the loop, we need to identify a P for which Dp{Q) > 0. (All the other steps of the algorithm can 
be performed efficiently for Q with sparse support.) Now, 

Dp{Q) = VpiQ) - {2K + bp) 

= Vp(Q)-(2K+^H^) 

'ipp. 


(i Y.\=l Ltt Qi-(7r{xi)\xi)) ~ [2^: + 


Finding P such that Dp{Q) > 0 requires solving argmaxpgc(n) Dp{Q)- Once again, using the definition of </)(-, •), 
this is equivalent to the following problem: 


argmaxpgc(n).A>o \ ELi (q^iS)R) + 

Vt(P)<(|; + A)l. 


s.t. 


Both problems (fTTl i and (fTsT i are of the following form: 

maxx — ZX such that (x, y, A) € Ki fl K 2 - 


(18) 


(19) 


where 

and 


K 2 := {(a:, y, A) ; y < {B'/T + A)l} n [0,1] 


d+2 




:= |(x, y, A) : X = Rt{P), y = Vt(P) for some P G C(n), X G [0,1]|, for (fTTl) . 

, for (fTsTl. 


Ki := (x,y, A) : 


iELE.PM( 


■0p(Z+l) 

Q/^(-K(Xi)IXi) 


+ r^(7^)^ , 


y = Vt(P) for some P G C(n), X G [0,1] 


Recently. iLee et al.l 1120151 Theorem 49] gave a fast algorithm to solve problems of the form ( fT^ . given access to 
oracles that solve linear optimization problems over Ki and /T20 The algorithm makes 0(d) calls to these oracles, 
and takes an additional 0(df) running time0 A linear optimization problem over Ki is equivalent to the AMO; the 
linear function defines the “rewards” that the AMO optimizes for0 A linear optimization problem over K 2 is trivial 
to solve. 

Therefore, each of these problems can be solved using 0{d) calls to AMO. □ 


***Alternately, one could use the algorithms of lVaidyj Il989alf9l to solve the same problem, with a slightly weaker polynomial running time. 
"Here, O hides terms of the order i'd/t), where e is the accuracy needed of the solution. 

'^These rewards may not lie in [0,1] but an affine transformation of the rewards can bring them into [0,1] without changing the solution. 
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C.2 AMO-based Implementation for CBwR 

Lemma 16. For CBwR, Algorithm^can be implemented using 0{d) call to AMO (Definition\J} in the beginning 
before the loop is started, and 0{d) calls for each iteration of the loop thereafter. 

Proof In each iteration, we need to identify a P for which Dp{Q) > 0, where 


Dp{Q) 


Vp{Q)-{2K + bp) 
Vp{Q) - {2K 






1 


i—1 7T 


Ql^{TT{Xi)\Xi) 


- \2K + 


f(VtiP),S)-f(Vt{Pt).S) 


Finding P such that Dp{Q) > 0 requires solving argmaxpgc(n) Dp{Q), which is essentially minimizing a convex 
function over a convex set 


C = {y e [0,1]-*- : y = [i EL. E, ; V,(P)], 3P £ C(n)), 


using only a linear optimization oracle (AMO) over C. 
Similarly, the problem of finding Pt, i.e., solving 


argmaxpgc(n)/(4(^’)), (20) 

using access to AMO, can also be formulated as minimizing a convex function over a convex set, using only a linear 
optimization oracle. In fact, below we show that given any convex function g, a convex set C, both over the domain 
[0,1]'^, and access to a linear optimization oracle over C, that solves a problem of the form mine ■ x : x G C, the 
problem minp(a;) : x G C can be solved using 0{d) calls to the linear optimization oracle. This completes the 
proof. □ 


Lemma 17. Suppose that we are given a convex function g, a convex set C with non-empty relative interior, both over 
the domain [0,1]"^, and access to a linear optimization oracle over C, that solves a problem of the form mine • x : 
X G C. Then, the problem min g(x) : x G C can be solved using 0(d) calls to the linear optimization oracle and an 
additional 0{d^) running time. 


The proof of this lemma uses too ls from convex optimization. W e show how to solve this convex optimization 
problem using cutting plane methods I Vaidva . 1989a . Lee et all 2015ll . We first show a simple variant of these cutting 
plane algorithms that can be used to solve a convex optimization problem such as the one above, given access to a sep¬ 
aration oracle over the convex set C, and a subgradient oracle for the function g. Then we define a dual optimization 
problem of the given problem, and show that a separation oracle for the dual constraint set can be implemented using a 
linear optimization oracle over C; thus we can solve the dual problem using cutting plane methods. Finally, we show 
that once the dual problem is solved, for the primal problem, it is sufficient to optimize over the convex hull of the 
vectors in C returned by the linear optimization oracle over C, during the run of the algorithm. Since the number of 
such vectors is only 0{d), this can then be done efficiently. 

A separation oracle for a convex set C is such that given a point x, it returns either 


• that X G C, OT 


• a separating hyperplane, given by a and b s.t. a ■ x > b but a ■ y < b\/y G C. 

Cutting plane methods solve a convex optimization problem of the form ‘find x G C, or return that C is empty’, given 
access to a separation oracle for C. We first outline how to use cutting plane methods to solve an optimization problem 
of the form ‘min g{x) s.t. x G C\ given access to a subgradient oracle for / and a separation oracle for C. (One can 
use binary search on the optimum value to reduce it to a feasibility problem, but we show here how one can directly 
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use a cutting plane algorithm.) Given any point x, we first run the separation oracle for C with input x, and return 
a separating hyperplane if that is what the oracle returns. If the separation oracle for C returns that x G C, then we 
return a separating hyperplane of the following form, with y as the variable. 


V5(a;) • y < Vg{x) ■ x, 


where Vp is any subgradient of g at x. This is a valid inequality for y = x* := argmin5(x) : x G C, and x ^ x*, 
due to the convexity of g. If the set of inequalities we return during the run of the algorithm becomes infeasible, then 
it must include an inequality of this kind for some point x with ||x — x* || < e, where e is the accuracy of the solution. 

We do this until the cutting plane algorithm returns that the set is empty, at which point we find the point 
argminp(x) : x G C, and x was queried during the run of the cutting plane algorithm. We return this as the 
optimum point. 

The cutting plane algorithm outlined above cannot be applied directly to our problem since we do not have a 
separation oracle for C. It is well know n that separation and (l inear) optimization are polynomial time equivalent for 
convex sets, using the ellipsoid method iGrotschel et akl 1198811 . Since we have a linear optimization oracle for C we 
could use this reduction to get a separation oracle. We show a more efficient method here, by using this oracle to solve 
the dual optimization problem. Define the Fenchel conjugate of g as 


g*{e) := max{0-x-p(x)}, 


and let the support function of the set C be 

hc{0) := max{0 ■ x : x G C} . 

X 


Lemma 18. 


— minp*(0) + hc{6) < minp(x) : x G C. 


This holds with equality if C has a non-empty relative interior. The former optimization problem is called the dual of 
the latter. 


Proof. The proof follows from the fact that he is the Fenchel co njugate of the indic ator function of C (which is 0 
inside C and oo outside). This is a special case of Theorem 13.1 in iRockafellan 020 1 511 . □ 


A subgradient oracle for g* can be implemented if we can solve the unconstrained optimization problem, max0 • 
X — g{x). We assume that g is represented in such a way that we can solve this in polynomial time. A subgradient for 
he is simply the arg max in its definitio n, and this is esse ntially what the linear optimization oracle gives us. 

We use the cutting plane algorithm of Lee et akl ll2015l] to solve the problem, min g* (0) -t- he (0), outlined above. 
The algorithm runs in time O(d^) time and makes 0(d) calls to the separation/subgradient oracle. Let Xi, X 2 , ■. ■ Xjv G 
C denote the subgradients of he returned during the run of this algorithm. Then this run of the algorithm would remain 
unchanged if C were to be replaced with Conv{xi, X 2 , •. ■, xn), the convex hull of xi, X2 ,..., xn- Therefore the 
optima of these two convex programs are close to each other, and by strong duality, so are the optima of their duals. 
Hence 

minp(x) : x G Conv(xi, X 2 , •. •, xjv) 


is a good approximation to minp(x) : x G C (the problem we originally set out to solve). Further, this convex 
program can be solved efficiently since N = 0(d). 


D Regret Analysis for Section H]: CBwK 


The regret analysis is structurally similar to that of lAgarwal et al.l 1 2014 1. but differs in many important details as we 
also need to consider budget constraints. 
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The following quantities, already defined in the main text, are repeated here for convenience; 


B' = B-To- c^/KT\n{T\U\/S), 


(^(v, B') = max 
Reg(P) = 


Vi - 


B' 

Y 


Reg,(P) = 


{Z + l) 

1 

(zTT) L 


{R{P')-R{P) + Z<t>(y{P),B')). 




(Pt) - Zc^{Yt{Pt),B') - (r,{P) - Z4>iyt{P),B')) 


Fix the epoch schedule 0 = tq < ti < r2 < ..., such that Tm < Tm+i < 2rm for m > 1. The following 
quantities are defined for convenience; for m > 1, 

dt ;= ln(16f2|n|(d+l)/,5), 

mo := minim G N ; - < -1, 

Tm 4iT 

to := mm{f £ N ; y < , 


P ■= 


sup 


m>mo V '^m—1 

A few quick observations are in place. First, the quantity ^m, as defined in Algorithm [T] can be rewritten as p™ = 
min{^, \J }• ™ Pm = \J ■ Furthermore, dt/t is non-increasing in t and p„i is non-increasing 

in m. Finally, p < since r^+i < Sr^- 

Finally, recall that Algorithm [T] consists of two phases. The first phase consists of pure exploration of To = 
TiPL In steps to estimate Z (see Appendix iBli. followed by a second phase that explores adaptively. The total 
regret of Algorithm [T] is the sum of regret in the two phases. Most of this appendix is devoted to the regret analysis 
of the second phase. Note that the number of time steps in second phase is T' = T — Tq. For simplicity, we use T 
instead of T' in the proofs below. Since To = o(T), this changes regret bounds by at most a constant factor. 


D.l Technical Lemmas 

Definition 2 (Variance estimates). Define the following for any probability distributions P,Qg C(n), any policy 
TT £ n, and p G [0,1/iT]; 

1 


Var(Q, 7 r,p) 

■■= Ex.- 


Varm(Q,T,p) 

;= E^. 


Var(Q,P,p) 

;= E,. 

.p[V£ 

(15; P, P) 

;= E^. 

.p[Vt 


1 




where E 


Xr^H^ 


denotes average over records in history H^ 


Furthermore, let m{t) ;= min{m £ N ; f < Tm}, be the index of epoch containing round t, and define 

Vt(P) ;= max {N&x{Qm, P, Pm)) , 


for all f £ N and P G C(n). 

Definition 3. Define £ as the event that the following statements hold 
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• For all probability distributions P,Q G C(n) and all m > mo, 

Var(Q, P, fim) < 6.4Var™(Q, P, fim) + 81.3iF , (21) 

• For all P G C(n), all epochs m and all rounds t in epoch m, and any choices of \m-i G [0, /im-i], 

iPt(P) - P(P)I < Vt(P)A^.i +, (22a) 

||Vt(P) - V(P)|U < Vt(P)Am-l + TT^ • (22b) 


Lemma 19. Pr(£’) > 1 — (5/2). 

Proof. Lemma 10 in lAgarwal et al.l 1 2014 1 can be readily applied to show that, with probability 1 — 5/4, 

Var(Q, TT, /r„) < 6.4Varm((5, tt, Hm) + 81.3iL 

for all Q G C(n) and tt G If. Now, taking expectations on both side over tt ^ P, we get the fir s t cond ition. 

For the second condition, the proof is similar to the proof of Lemma 11 in Agarwal et al. 1 20141] . but with some 
changes to account for distribution over policies. Fix component j of the consumption vector, policy tt G If and time 
t G [T]. Then, 




1 


1 1 

< - < 


where 1/ := Vi(7r(xi))j - Vi(7r(xi))j. 

Round i is in epoch m(-i) < m, so 

' ~ ~ ~ Pm-1 ^ 

by definition of fictitious reward vector Vt. Furthermore, E[l)|Pt_i] = 0 and 

— —15 Mm(2) —l) 

from the definition of fictitious reward and of 'Vai{Q, tt, ^). 

Let P(7r) := Then, by Freedman’s inequality 

(Lemma I?]) and a union bound to the sums (1/t) Y^i=i ^ (1/^) have that with probability at 

least 1 — 25/(16f^((i + l)|n|), for all Xm-i G [0, Hm-i], 

N, ln(16f2(d+l)|n|/5) 

< (e-2)P(7r)A^_i + ^- ^ 


t 


1 


n\TT/-\\ , ln(16f^(d+l)|n|/5) 

/ Yi ^ \G- 2,jU (tt ) Am—1 H” , 

t cAm-1 

i—l 

Taking union bound over all choices of t < T and tt G If, we have that, with probability at least 1 — for 


all TT and t. 


Vt{7r),-V{TTf 

V(7r), - VtiTT), 


< (e — 2)P(7r)Am-i +-T- 

< (e - 2)P(7r)A™_i + 


and 


(23) 

(24) 
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Note that 


E^^p[U{tt)] = 


< 


1 

t 

1 

t 

1 

t 


t 

j2^^-p[yiQ 7n{i) — l 1 Mm(z) —l)] 

t 

^ ^ V" {Qm{i) — l-> — 

i^l 

t 

J2Vt{P)=Vt{P), 

i=l 


by definition of Var((5, P,/i). Also, by definition, E7rgp[V(7r)j] = Y{P)j,'E^r^p[V{Tr)j] = V(P)j. Therefore, 
taking expectation with respect tt ^ P on both sides of Equations (|2^ and (l24l i. we get that, with probability 1 — 
5^,forallPeC(n) 

Vt(P),-V(P), < (e - 2)Vi(P)A™_i +, and (25a) 

V(P),-Vt(P), < (e - 2)Vt(P)A™_i +. (25b) 

Note that Equation ( I22al ) for rewards can be similarly proved to hold with probability 1 — 47^pYy- ^ union 

bound over reward and the d dimensions of the consumption vector, we have that Equation (l22l i holds for all t and all 
P G C(n) with probability 1 - □ 


Lemma 20. Assume event £ holds. Then for all m < mo, and all rounds t in epoch m, 

|Pt(P) — P(P)| < max{ 

||Vt(P)-V(P)||oo <max{ 



(26a) 


(26b) 


Proof. We only prove the second inequality, as the first may be thought of as a one-dimensional special case of the 
second. By definition of mo, for all m' < mo, we have Therefore pm-i = Eirst consider the case 

when \J tvdP) < = W- Then, substitute A^-i = \J tvdP) ^ 


Otherwise, 


Substituting A^-i = p.m-i 


|Vt(P)-V(P)|U< 

4.K^dt 


4dtVt(P) 


Vt(P) < 


t 


2K 


, we get 


|Vt(p)-v(p)|U< 


4.Kdt 


□ 


Lemma 21. Assume event £ holds. Then, for all m, all t in round m, all choices of distributions P G C(n) 

m < mo 


\Rt{P) - RiP)\ < 


|1V,(P)-V(P)|U< 


iKdtVtjP) iKdt 
t ’ t 


yt{P)Pm-l + 


tpm-1 ' 


4:KdtVt(P) iKdt 
t ’ t 


^Vt(P)/im-l + 


tpm-l ' 


m > mo , 

m < mo 
m > mo . 


(27a) 


(27b) 
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Proof. Follows from the definition of event £ and Lemma l20l 
Lemma 22. Assume event £ holds. Then for t > to in epoch niQ, 

\RtiP) - RiP)\ < ■ 

||Vt(p)-v(p)|U< 

Proof. Follows from Lemma|2T] using that Vt < 2K, and < 1 for f > fg in epoch mg. 


8 Kdt 


8 Kdf 


□ 

(28a) 

(28b) 

□ 


Lemma 23. Assume event £ holds. For any round t G [T], and any policy P € C(n), let m G N be the epoch 
achieving the max in the definition ofVt{P). Then, 


Vt{P) < 


_ tf Fm — 2K ’ 

0 X + if u < — 

Uil\ -t IJ p,,n ^ 2K ^ 


where 9i = 94.1 and 62 = '0/6.4 = 100/6.4 are universal constants. 

Proof. Fix a round t and a policy distribution P G C(n). Let m < m(t) be the epoch achieving the max in the 
definition of Vt(P) (Definition| 2 l), soVt(P) = V{Qm, P, Fm)- If Mm = l/(2Pr),whichimmediately implies Vt(P) < 
2K by definition. 

If p.rn < l/(2Ff), then = min{ 5 ^, and we have 

V{Qm,P,Fm) < 6.W{Qm,P,Fm)+S1.8K 

< 6.4V{Q^,P,fLm) + S1.3K 

( (P)\ 

< 6.4 2K+ +81.3K 


= 0tK 


i>Fn 
^grJP) 

^2Mm 


where the first step is from Equation (I 2 TI 1 (which holds in event f); the second step is from the observation that 
Qm{Tt) > Qm(tt) for all TT G 11; the third step is from the constraint in (OP) that Qm satisfies; and the last step 
follows from the universal constants 0 i and 02 defined earlier. □ 


Lemma 24. Assume event £ holds. Define cg := 4p(l + 0i). For all epochs m > mg, all rounds t > to in epoch m, 
and all policies P G C(n), 

Reg{.P) < 2RegfiP) + CqK fLrn 
RegfiP) < 2RegfiP) + cqK pLm, 

for Reg{P),RegfiP) as defined in Section 1377] 

Proof. Proof is by induction. For base case m = mg, and f > fg in epoch m. 

Consider m = mg, and f > fg in epoch m. For all P G C(n), 

(Z + l)(R;g,(P)-Reg(P)) = P,(P,)-4(P)-P(P') + P(P) 

-Z(fi(V,(P,), B') - fi{Vt{P), B') + 0(V(P), B')) . 

(29) 
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W.l.o.g., we can assume that B > 2( In + 2cy ^KT ln(T|n|/5), because otherwise B = 

0(AyKTln(cff|TT|y7?) and the regret bound of Theorem[T]is trivial. Under this assumption, B > 2{B — B'), so 
that B' > i?/2. Also, observe that since B > B', OPT(i?) > OPT(i?'). Then, by Lemma|2]and choice of Z as 
specified by Lemma[3 we have that for any 7 > 0 

OPT(B' + 7) < OPT(B') + 1^7. (30) 

Now, since P' is optimal policy for budget i?', we obtain that R{P') = OPT(i3'). Also, by definition of 4>{'V{Pt), B'), 
R{Pt) can violate any budget constraint by at most (l){y{Pt), B'), which gives R{Pt) < OPT(i?' + (f>{'V{Pt), B')). 
Therefore, using (l30l l with 7 = (j}(y{Pt), B'), 

R{P') > R{Pt) - > R{Pt) - B'). 

Substituting in (|2^ . we get 

(Z + l)(Rii,(P)-Reg(P)) < Rt{Pt)-R^{P)-R{Pt) + Z(j){Y{Pt)B') + R{P) 

-z(<j>(yt{Pt),B') - 4>{Y,{P),B') + 4>{Y{P),B')) 

< \Rt{Pt)-R{Pt)\ + \Rt{P)-R{P)\ + 

Z\\Yt{Pt) - V(P*)||oo + ^||Vt(P) - V(P)||oo . (31) 

For the other side, by definition of Pj, we have that R{Pt)) — Z(j){Y(Pt), B') > R{P) — ZtpiY (P), B') for any 
P G C(n). Substituting in (|29] |, and using that ^(V(P'), B') = 0, we get 

(Z + l)(Rii,(P)-Reg(P)) > Rt{P') - Rt{P) - R(,P') + R{P) 

-z(<).(Vt(P'), P') - (^(Vt(P), B') + <^(V(P), B')) 

> -\R,{P')-R[P')\-\R,{P)-R[P)\ 

-Z\\Yt{P') - V(P')||oo - Z\\Yt{P) - V(P)|U . 

Therefore, 

(Z + l)(Reg(P)-R;g,(P)) < |4(P')-P(P')| + |4(P)-i?(P)| 

+Z||Vt(P') - V(P')||oo + Z\\Yt{P) - V(P)|U. 

(32) 


Substituting bounds from LemmaH] we obtain. 


1-—- ^ ^ M 18Kdt 

|Regt(P) - Reg(P)| < 2^ —j— < coAT/r^, 

for Co > 4v/2. The base case then follows from the non-negativity of Reg((P) and Reg(P). 

Now, fix some epoch m > mo. We assume as the inductive hypothesis that for all epochs m' < m, all rounds t' in 
epoch m', and all P e If, 


Reg(P) < 2Regt,(P) + coAT^m', 
Regt,(P) < 2Reg(P) + coAT^m'- 
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Fix a round t in epoch m and policy P £ C(n). Using Equation and Equation (l22l i (which holds under event 

f), 

Reg(P)-R;g,(P) < ^^^(|4(P')-P(P')| + |4(P)-P(P)| 

+Z||V,(P') - V(P')||oo + Z\\Vt{P) - V(P)IU) 

< (Vt(P)+Vt(P')W-i + -^. (33) 

tMm-l 

Similarly, using Equation (lUT i. 


Rii,(P)-Reg(P) < {Vt{Pt)+Vt{P))^in.-l + 


2dt 




By Lemma|23 there exist epochs m', m" < m such that 


Vt(P) < 0iK + 


Vt(P') < 9iK + 


Regt(^). 


I 1 f^m' ^ 


2K 


I i ^^m" < 2 ^ r ■ 


If fim' < l/(2Ff), then mo < m' < m — 1, and the inductive hypothesis implies 

^ 2Reg(P) + coK^lm, = £2:^ + 2Reg(P) ^ coK , 2Reg(P) 


02^J^r. 


02 Mn 


02 Mn 


02 02 Mm-l 


(34) 


where the last step uses the fact that fXm' > ftm-i for m' < m — 1. Therefore, no matter whether fim' < l/(2iT) or 
not, we always have 


Vt{P)fi^-i < ( 01 + ? ) f Reg(P). 


(35) 


If fim" < l/(2Ff), then mg < m" < m — 1, and the inductive hypothesis implies 


Reg,„„ (P') ^ ^eg(P') + CoP^m" coK 


02/^7] 


02 


02 


where the last step uses the fact that Reg(P') = 0. Therefore, no matter whether /im" < l/(2fT) or not, we always 
have 




(36) 


Combining Equations (l3^ . dTSl) and (l36] l gives 


Reg(f^) < (R^t(^) + 2(01 + . (37) 

2 

Since m > mo, the definition of p ensures that Um-i < PMm- Also, since t > r^-i, < pKaj^. 

_ Mm—1 

Applying these inequalities and the facts cq = 4p(l + 0i) and 02 > 8p in Equation (iJTl i. we have thus proved 

Reg(P) < 2Reg((P) + coKp^ . (38) 


The other part can be proved similarly. By Lemma [2^ there exist epochs m” < m such that 
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If fim" < l/(2Jf), then uiq < m" < m — 1, and the inductive hypothesis together with Equation (l38l l imply 
Rggr^// jPt) ^ 2Reg(P,;) + CoKfim" ^ 2(2Reg(Pt) + CpK^lrn") + CpK^irn" 

Since Reg(Pi) = 0 by definition, the above upper bound is simplified to 

Reg^^„(Pi;) ^ iCpK^rn" _ ScpK 

Therefore, no matter whether < l/(2ii') or not, we always have 

12t(Pt)^m-l < (^1 + ■ (39) 

Combining Equations (l34l) . dTSl) and ( |39] | gives 

R^t(P) < (1 + |-)Reg(P) + 2(01 + . (40) 

^2 “2 

2 

Since m > mp, the dehnition of p ensures that pm-i < PMm- Also, since t > < pKpm. 

Applying these inequalities and the facts cp = 4p(l + 0i) and 02 > 8 p in Equation dJTl) . we have thus proved the 
second part in the inductive statement: 

Reg((P) < 2Reg(P) + cpKprn , 


and hence the whole lemma. □ 

D.2 Main Proof 

We are now ready to prove Theorem [T] By Lemma [19] event S holds with probability at least 1 — 5/2. Hence, it 
suffices to prove the regret upper bound whenever E holds. 

Recall from the description of[T]in Section [3] that the algorithm samples action at taken at time t in epoch m 
from smoothed projection of Qt, where Qt is constructed by assigning all the remaining weight from Qm-i 

to Pt- Erom the discussion in Appendix O we can represent Qt as a linear combination of P G C(n) as follows: 

Qt = X]peC(n) ctp{Qt)P = SpGC(n) (^p{Qm-i)P + (1 ~ Spec(n) ctp{Qm-i))Pt- 

Let Pt = where m{t) denotes the epoch in which time step t lies: m{t) = m for t G [rm-i + 1, Tm]- 

Rin-^Y.^{Qt) 

t 

= E ocp{Qt){R{P*) - R{P)) 

t Pec(n) 

= ^E E o.p{Qt){R{P')-R{P)) + {R{P*)-R{P')) 

t Pec(n) 

= ^E E o.p{Qt){Z + \)R^g{P)-Z^{\{PlB') + {R{P*)-R{P')), 

t Pec(n) 

< ^^^E E c.pmmip) p {R{p*) - R{p')). (41) 

t PGC(n) 
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The last inequality simply follows from the non-negativeness of the function •). Now, by observation in Lemma 
121 and using B' > B/2, B' = B - In ^ - cv^/Lrin(T|n|/(5), 


«(P-) - «(P') < ^ 

To bound first term in (HTI) . note that for m < niQ, Hm-i = So, trivially, for t in epoch m < itiq , 

ap(Qt)Reg(P) < coiTV'ftm-i • 

PeC(n) 


(42) 


Suppose £ holds. Then, Lemma |24] implies that for all epochs m > mo, all rounds f > fo in epoch m, and all 
policies P G C(n), we have 

Reg(P) < 2Reg((P) + cqK^i^ . 

Therefore, for t in such epochs m, using the first condition in OP (from Section 0, we get 

y] ap{Qt)RegiP) < y] ap{Qt){2Reg^{P) + coKip^rn-i) 

PGC(n) PGC(n) 

= y^ ap((5m_i)(2Reg((P) + 

PGC(n) 

< {co + 2)K'ipij,m-i- (43) 

The equality in above holds because by definition, Qt assigns remaining weight from Qm-i to Pt, and Reg((Pt) = 0. 
Substituting in Equation (HTt . we get. 


^ (Z + l)iT^(co + 2)^ ,, _ , , 2C-OPT 

^ —l(Pm Pm— l) “b jj \l rji in(T|II|/o) . 


T 


B M T 


(44) 


Applying an upper bound (Lemma 16 of lAgarwal et al.l 1120141] ') on the sum over above gives 

) 


V- . N / 16T2(d+l)|n| /T, 64P2(d+i)|n|\ 

y^^m-i(rm - Pm-i) < 4 ( In-^-b-) 


Substituting these bounds, and using Z < 24^^ + 8 from Lemma|3] we get 


1 V- N ^/OPT/ IK^ dT\Il\ /Id dr|n|\ 


Next, we show that y Pt(at), is close to y Recall that the algorithm samples at from Define 

the random variable at step t by 

Yt := rt{at) - | “ K^it)Qt{TT)rt{Tr{xt)) +Atty]pt(a) j ■ 

VttGII a / 

It is easy to see E\Yt\Ht-i\ = 0, so the Azuma-Hoeffding inequality for martingale sequences implies that, with 
probability at least 1 — d/ 2 , 

e:= J±ln->\-yYt\. 

\ 2T S -'t ' 
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By definition of Yt, we have with probability at least 1 — S/2 that 


t 


t 





( 46 ) 


which implies, together with the triangle inequality and Equation (l45l l. that (assuming E holds) with probability 1 — 


t 



IK^ T|n| 


-Ini 

T S 


K 

Y 


■ In 



(47) 


By Lemma [19] event £ holds with probability at least 1 — S/2. Therefore, by multiplying by T on both sides and 
adding Tq = Jn (an upper bound of cumulative regret incurred in the first Tq steps of Algorithm[T]i, we have 
that the algorithm will have a regret bounded by 


~ /OPT ;-TT 12KT d|n|\ 

o(^-^\/i^rin(|n|) + —^In^j 

with probability at least 1 — S and complete the proof of Theorem]!] if the algorithm never aborted due to constraint 
violation in Step[T0] But, from Lemmaj^ the event that the budget constraint is violated happens with probability at 
most 1 — S/2. Combining this with the bounds on reward given by (l47l i. and that E holds with probability 1 — we 
obtain that the regret bound in Theorem[T]holds with probability 1 — ^. 

Lemma 25. With probability at least \ — S/2, the algorithm is not aborted in Steo MOl due to budget violation. 

Proof. The proof involves showing that with high probability, the algorithm’s consumption over B', in steps t = 
1,..., T — Tq, is bounded above by KT ln(|n|/(5) for a large enough universal constant c. And, since + Tq + 
cYKT\nm/S) = B, we obtain that the algorithm will satisfy the knapsack constraint with high probability. This 
also explains why we started with a smaller budget. 

More precisely, show that assuming £ holds, in every epoch m, for every t in epoch m. 


f(V{Qt)Y')<Yco + 2)K'iP^im 


(48) 


Recall that B') was defined as the maximum violation of budget ^ by vector V(P). To prove above, we 

observe that our choice of Z ensures that (j>(V{P), B') is bounded by Reg(P) as follows. By Equation ( l30l l. for all 

P e c(n) 

R{P')>R{P)-^f{V{P),B'), 

so that 

{Z + 1) Reg(P) = R{P') - R{P) + Zf{V{P), B') > ^f{V{P),B'). 

Summing over P € C(n), with weights ap{Qt), and using Z > 1 


E ap{Qt)f(ViP),B')<4 E ap{Qt)Reg{P). 

PeC(n) PeC(n) 

Now, (/){■, B') is a convex function, therefore, applying Jensen’s inequality. 


<t>{Y{Qt),B')<A E ap{Qt)R^g{P)- 

PeC(n) 

Substituting from Equation (02]) and (03), we obtain the bound in Equation (HSl ). Averaging (03 over all t and 
using Jensen’s 

E ^ ^ E 73') < ^ E 4(co+ m/’Pn.it) 

t t t 
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The sum on the right hand side can be bounded using (l45l i: 




16{co + 2)K^ ( 


T 


T , 64r2|n| \ 


Also, we can use arguments similar to those used for deriving ( |46] ) to obtain that for every i = 1,... ,d, with 


probability at least 1 — 


2d’ 




'T 


(49) 


t=i 


where e = \l 


Using these bounds along with Equation ( l45l l. we get that with probability 1 — 




< O 


K T|n| 


Id K ^ 

— In — H-In 

T S T 


r|n|y 


Therefore, for large enough constant c, and large enough T > max{iT, d}, 

and by definition of B'), this implies that with probability 1 — f, for all j = 1,..., d, 

E'"*(“*)f < -S' + c\jKT In 

Therefore, algorithm will not exceed B = B' + c-\/iTTln(T|n|/(5) with probability 1 — f assuming £ holds. □ 


E Regret Analysis for Section ^ CBwR 

The analysis is structurally similar to that in Appendix iDl Here, we only describes the differences and omit the most 
of the identical steps. 

The first difference is in the definition of regrets, which have been define in Section|5j for P € C(n), 

Reg(i^) = - /(V(P))) 

Pt=arg max /(Vt(P)) 

PGCo(n) 

- /(Vt(P))) . 

Other convenience quantities (dt, rriQ, to, and p) are defined in the same as in AppendixiDl except that the factor d + 1 
is replaced by d in dt. 
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Definition 4 (Variance estimates). Define the following for any probability distributions P,Q£ C(n), any policy 
TT G n, and ^ G [0, l/K]: 


V{Q,Tr,n) 

:= 

.Vx 

1 


Vm {^Q ■) ^; m ) 

:= E^. 


1 

_Qf^{'K{x)\x) _ 

V{Q,P,p) 

:= E^. 


VmiQ,P,p) 

:= E^. 

^p\^m{^Q 1 '^7 M)] 


where denote average over records in history . 

Furthermore, let m{t) := mm{m G N : f < Tm}, be the index of epoch containing round t, and dehne 

Vt{P) ■■= max {V{Q,n,P,l^m)}, 


for alH G N and P G C(n). 

Definition 5. Dehne E as event that the following statements hold 
• For all probability distributions P,Q G C(n) and all m > mo, 

V{Q, P, fMm) < 6.4V„(Q, P, Urn) + 81.3iF. (50) 


• For all P G C(n), all epochs m and all rounds t in epoch m, any S G (0,1), and any choices of Am-i £ 

[0, /^m—l]. 

^||Vt(P) - V(P)|| < VtiP)Xm-l + ■ (51) 


Lemma 26. Pr(£’) > 1 — (^/2). 

Proof The proof is identical to that for Lemma [T^ up to Equation ( l25l l, which gives concentration on a hxed dimen¬ 
sion of the observation vector. Now, apply union bound on all d dimensions, we have that, with probability 1 — f, for 
all t and all P G C(n), we have 


|ldir'l|Vt(P) - V(P)|| < (e - 2)VtiP)Xm-i + 


tXm—l 


Lemma 27. Assume event £ holds. Then for all m < mo, and all rounds t in m, 

||ldir'||Vt(P)-V(P)|| <max{. 


4,KdtVt ^Kdt 


t 


t 




□ 


(52) 


Proof By dehnition of mo, for all m' < mo /i(„ = 1/{2K). Therefore, /im-i = 1/(2A"). First consider the case 
when \l< Pm-i = Then, substitute A^-i = \l to get 


tVt’ 


Otherwise, 


||ld||-'||Vt(P)-V(P)|| < 



Vi < 


^K^dt 

t 
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Substituting A^-i = Mm-i = we get 


iidir'iivt(p)-v(p)ii < 


4.Kdt 


Lemma 28. Assume event E holds. Then, for all m, all t in round m, all choices of distributions P £ C(n) 


|ld||)-'l|Vi(P)-V(P)|| < 


max 


4KdtVt{P) AKdt 

t ’ t 


Proof Follows from definition of event £ and Lemma l27l 


m < mo 
m > mo . 


□ 


(53) 


□ 


Lemma 29. Assume event £ holds. For any round f £ N, and any policy P G C (If), let m € N be the epoch achieving 
the max in the definition ofVt{P). Then, 


Vt{P) < 


2K 
9iK + 


jP) 


if Pm 2K ’ 
if Pm < 21? ’ 


where 9i = 94.1 and 62 = '0/6.4 = • • • are universal constants. 
Proof. Identical to that of Lemma |23] 


□ 


Lemma 30. Assume event £ holds. Define cq := 4p(l + Of). For all epochs m > mo, all rounds t > to in epoch m, 
and all policies P £ C(n), 

Reg{P) < 2Reg^{P) + coK prn 
RegfiP) < 2RegfiP) + coKprn ■ 

Proof. We start with two useful inequalities that show the closeness of Reg(P) and Reg((P). One on hand, using the 
triangle inequality, the L-smoothness of the reward function /, and the definition of Pt, we have 


< 


< 


1 (R;g,(P) - Reg(P)) 

^ (/(Vt(Pt)) - /(Vt(P)) - /(V(P*)) + /(V(P))) 
i (/(V(P)) - /(Vt(P)) + /(V0P0) - /(V(P0)) 


1 


,/(Vt(P0)-/(v(PO) 


^ /(V(P))-/(V,(P)) 

< ||V(P)-Vt(P)|| + ||Vt(P0-V(P0||. 

Similarly, one can prove the opposite direction, using the definition of P* instead; 


(54) 


11101-1 (Reg(P)Rig,(P)) 

(-/(Vt(P0) + /(Vt(P)) + /(V(P*)) - /(V(P))) 
I (/(Vt(P)) - /(V(P)) + /(V(P*)) - /(V0P*))) 


1 

L 


< 


< 


/(V0P)) - /(V(P)) 


/(V(P*)) -/(V0P*)) 


< ||Vt(P)-V(P)|| + ||V(P*)-Vt(P* 


(55) 
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We now prove the lemma by mathematical induction on m. For the base case, we have m = mo and t > to in 
epoch mo- Then, from Lemma|23 using the facts that V* < 2K, and that < 1 for t > to in epoch mo, we get, 
for all P S C(n) that 


l|ldir'l|Vi(P)-V(P)|| <max 


AKdtVtiP) AKdt 


< 



Combining this with Equations]^ and |55] we prove the base case: 


Rii,(P) - Reg(P) 



— ^0^d'niQ • 


For the induction step, fix some epoch m > mo and assume for all epochs m' < 
m', and all distributions P £ C (If) that. 


m, all rounds t' > to in epoch 


Reg(P) < 2Reg(,(P) + coiT/r^' 
Regt,(P) < 2R&g^,{P) + coKnm'■ 


Then, from Equations]^ and |55] as well as Lemma|23 we have the following inequalities 

Reg(P) - Rii,(P) < (Vt(P) + Vt(P*))/i^_i + 

t^m-l 

^g^{P) - Reg(P) < (V*(P) + Vt{Pt))^Jim-l + , 

which are the analogues of Equations!^ andin the proof of Lemmal24l The rest of the proof is the same. □ 


E.l Main Proof 

We are now ready to prove Theorem [b] By Lemma |26] event E holds with probability at least 1 — 5/2. Hence, it 
suffices to prove the regret upper bound whenever E holds. 

Recall from Section [3] that the algorithm samples at at time t in epoch m from smoothed projection of 

Qm-i- Also, recall from Appendix ICl that Qm for any m is represented as a linear combination of P £ C(n) as 
follows: Qm = o:p(l5m)P — ^PGC(n) o:p(Qm)P “b (1 ^PGC(n) ^p{Qrn))^t- {Qm assigns all the 

remaining weight from Qm to Pj). 

Let Qt = Qm(t)-i> dt = dm{t)-P where m{t) denotes the epoch in which time step t lies: m{t) = m for 
t £ [Tm-i + l,Tm]. Then, 

/(V(P*)) - /(i ^ YiQt)) < fiV{P*) /(V(Q0) 

t t 

t 

^^E(/(V(P*)- ap(4)/(V(P)) 

i \ Pec(n) 

^ ap(ft)Ret(P), 

t PeC(n) 

where we have used Jensen’s inequality twice. 

With identical reasoning as in the proof in Appendix lD.21 we can prove, using the above inequality, that 

/(V(P*)) - /(I ^ Y{Qt)) < + E Hm-l{rm - Tm-l) ■ (56) 

t m 
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We next show that /(^ X]t^(Qt)) close enough to the regret that we are interested in. Specifically, fix a 
component i € [d] and let [v]i be the ith component of vector v. Recall that the algorithm samples at from Qf*. 
Define the random variable at step t by 

Zt := [vt(at)]i - ^(1 - K. 

ttGII a 


It is easy to see E[Zt\Ht-i] = 0, so the Azuma-Hoeffding inequality for martingale sequences implies that, with 
probability at least 1 — 5/{2d), 

1 , 4d , 1 ^ ^ , 

^lny>|-^ t|- 




Applying a union bound over i ^ [d], we have with probability at least 1 — 8/2 that 


t t 

which implies, together with the L-smoothness of /, that 


(57) 


Combining ( l56b and ( l58l l, we get 


fi /E E ^ (^+f E 




< Ud\\L + - T^-l) + ^ ■ 

t \ m j 

Applying the same upper bound for 'Ylim P‘m-\{xm — An-i) as in AppendixlDl we get 


(58) 


(59) 




8d. 


T„. {T)Tm{T) 

K 


(60) 


Now substituting the same bounds for and as well as the value of e, one gets the final regret upper bound, 

as stated in the theorem: 


-regret(T) = /(v(P*))-/(iE^*(«‘)) 


avg 


T 


< ||ld||iV'(4co + 16) I — In ^ 


K , 16T2|n| ^ jK , 64T2|n|^ 


In ■ 

T 6 


+ \\U\L 


1 , Ad 
— In — 
2T 8 


= O \\U\L 


K,T\n\ , .ATE , r|n|\ 


In ■ 

T 8 


In — H-In ■ ^ , 

T 8 T 8 1 


Note that a regret bound of above order is trivial unless T > K\n{T\n\/8). Making that assumption, we get the 
following bound in a simpler form: 


avg-regret(r) 
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